Case is to
This behavioral study was to be transferred to marketing team that will help them target the group of customers and personalize their user experience according to the study.
Solution is to base on Customer buying pattern for the Gifter & Self-Gifter Segmentation.
To segment their customer base into different homogeneous groups, RFM(Recency, Frequency, Monetary) Analysis is used.
# Importing required libraries for analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Reading the data we have
sales_df = pd.read_csv(r'C:\Users\affine\Downloads\Mock Project\Data\sj_q1.csv',low_memory = False) #Sales Data
email_df = pd.read_csv(r'C:\Users\affine\Downloads\Mock Project\Data\sj_q2.csv') #Email Address Data
customer_df = pd.read_csv(r'C:\Users\affine\Downloads\Mock Project\Data\sj_q3.csv') #Customer Data
product_df = pd.read_csv(r'C:\Users\affine\Downloads\Mock Project\Data\sj_q4.csv') #Product Data
address_df = pd.read_csv(r'C:\Users\affine\Downloads\Mock Project\Data\sj_q5.csv') #Address Data
#setting low_memory = False to avoid dtypewarning for some rows having mixed datatypes
address_df.shape
(1048575, 5)
# Checking if email address key is null for any order as it represents a customer
sales_df['email_address_key'].isnull().values.any()
False
We don't have any null values in Email Address Key column which represents the customer.
# Checking the sales data shape to know the number of rows and columns it have
sales_df.shape
(808460, 29)
# Getting info of how many customers we have in total
sales_df.email_address_key.nunique()
454317
# Checking if there are any duplicates in sales data as it's our primary datasource
duplicates = sales_df[sales_df.duplicated()].sort_values('email_address_key',ascending=True)
duplicates
| merch_category_rollup | merch_sub_category_rollup | MERCH_PRODUCT_CATEGORY | email_address_key | basket_id | original_basket_id | order_type | site | basket_start_date | submit_date | ... | local_currency_after_disc_sale_amount | usd_after_disc_sale_amount | usd_after_disc_cost_amount | usd_exchange_credit_amount | quantity | DIAMOND_SKU_1 | DIAMOND_SKU_2 | DIAMOND_SKU_4 | DIAMOND_SKU_5 | OFFER_ID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 332453 | Other Jewelry | Misc | Misc | 1227859 | 87307460 | 87307460 | STANDARD | BN | 21-04-2021 | 21-04-2021 | ... | 27.0 | 27.00 | 0.0000 | 0.0 | 0 | NaN | NaN | NaN | NaN | 10478.0 |
| 331935 | Other Jewelry | Misc | Misc | 1227859 | 48610624 | 48610624 | STANDARD | BN | 03-04-2021 | 02-04-2021 | ... | 27.0 | 27.00 | 0.0000 | 0.0 | 0 | NaN | NaN | NaN | NaN | 10478.0 |
| 657035 | Other Jewelry | Metals | Silver | 1264396 | 34050926 | 34050926 | STANDARD | BN | 03-06-2021 | 03-06-2021 | ... | 87.5 | 87.50 | 35.3208 | 0.0 | 1 | NaN | NaN | NaN | NaN | 44984.0 |
| 269593 | Other Jewelry | Metals | Silver | 1277299 | 94672258 | 94672258 | STANDARD | BN | 10-12-2020 | 10-12-2020 | ... | 75.0 | 75.00 | 30.0294 | 0.0 | 1 | NaN | NaN | NaN | NaN | 18138.0 |
| 285034 | Other Jewelry | Misc | Misc | 1277299 | 94672258 | 94672258 | STANDARD | BN | 10-12-2020 | 10-12-2020 | ... | 6.0 | 6.00 | 0.0000 | 0.0 | 0 | NaN | NaN | NaN | NaN | 10476.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 304940 | Other Jewelry | Misc | Misc | 31489201 | 19676740 | 19676740 | STANDARD | BN | 02-08-2022 | 02-08-2022 | ... | 30.0 | 30.00 | 0.0000 | 0.0 | 0 | NaN | NaN | NaN | NaN | 10478.0 |
| 315911 | Other Jewelry | Misc | Misc | 31489327 | 59724594 | 59724594 | STANDARD | BN | 31-07-2022 | 02-08-2022 | ... | 30.0 | 30.00 | 0.0000 | 0.0 | 0 | NaN | NaN | NaN | NaN | 10478.0 |
| 313273 | Other Jewelry | Misc | Misc | 31489437 | 69892179 | 69892179 | STANDARD | BNCA | 02-08-2022 | 02-08-2022 | ... | 43.0 | 33.66 | 0.0000 | 0.0 | 0 | NaN | NaN | NaN | NaN | 10478.0 |
| 314112 | Other Jewelry | Misc | Misc | 31490510 | 60739028 | 60739028 | STANDARD | BN | 03-08-2022 | 03-08-2022 | ... | 30.0 | 30.00 | 0.0000 | 0.0 | 0 | NaN | NaN | NaN | NaN | 10478.0 |
| 313222 | Other Jewelry | Misc | Misc | 31490614 | 87155101 | 87155101 | STANDARD | BN | 03-08-2022 | 03-08-2022 | ... | 30.0 | 30.00 | 0.0000 | 0.0 | 0 | NaN | NaN | NaN | NaN | 10478.0 |
7558 rows × 29 columns
We have 7,558 duplicated rows in the dataset.
# Removing duplicated rows from the sales data
non_dup_df = sales_df.drop_duplicates()
non_dup_df
| merch_category_rollup | merch_sub_category_rollup | MERCH_PRODUCT_CATEGORY | email_address_key | basket_id | original_basket_id | order_type | site | basket_start_date | submit_date | ... | local_currency_after_disc_sale_amount | usd_after_disc_sale_amount | usd_after_disc_cost_amount | usd_exchange_credit_amount | quantity | DIAMOND_SKU_1 | DIAMOND_SKU_2 | DIAMOND_SKU_4 | DIAMOND_SKU_5 | OFFER_ID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | 16983312 | 81947282 | 81947282 | STANDARD | BN | 19-09-2020 | 19-09-2020 | ... | 5558.00000 | 5558.0 | 3726.6400 | 0.0 | 1 | LD12713965 | LD12142705 | LD12681121 | LD13880254 | NaN |
| 1 | Engagement | Engagement | BYO Semi-Mounts | 22822442 | 59639029 | 75113342 | EXCHANGE | BN | 21-09-2021 | 21-09-2021 | ... | 2149.00000 | 2149.0 | 1219.2000 | 2149.0 | 1 | LD14946369 | NaN | NaN | NaN | NaN |
| 2 | Engagement | Engagement | BYO Semi-Mounts | 19696924 | 23989389 | 23989389 | SPECIAL ORDER | BN | 06-06-2021 | 07-06-2021 | ... | 5621.00000 | 5621.0 | 3868.9700 | 0.0 | 1 | LD16111745 | NaN | NaN | NaN | NaN |
| 3 | Engagement | Engagement | BYO Semi-Mounts | 24120678 | 12758238 | 12758238 | STANDARD | BN | 20-09-2021 | 20-09-2021 | ... | 2759.00000 | 2759.0 | 1776.2600 | 0.0 | 1 | LD16655615 | NaN | NaN | NaN | NaN |
| 4 | Engagement | Engagement | BYO Semi-Mounts | 23536382 | 84045443 | 84045443 | SPECIAL ORDER | BN | 21-12-2020 | 21-12-2020 | ... | 2622.00000 | 2622.0 | 1305.7100 | 0.0 | 1 | LD14659784 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 808455 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 23036258 | 22989040 | 22989040 | STANDARD | BN | 15-09-2020 | 15-09-2020 | ... | 2380.00000 | 2380.0 | 1749.0308 | 0.0 | 1 | LD14113460 | LD14252074 | NaN | NaN | NaN |
| 808456 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 10708737 | 78485771 | 78485771 | STANDARD | BN | 16-02-2021 | 16-02-2021 | ... | 2357.00001 | 2357.0 | 1766.8096 | 0.0 | 1 | LD14409557 | LD14407739 | NaN | NaN | NaN |
| 808457 | Engagement | Engagement | BYO 3-Stone Metal | 19190442 | 64237203 | 64237203 | STANDARD | BN | 08-10-2020 | 08-10-2020 | ... | 6235.00001 | 6235.0 | 4448.4775 | 0.0 | 1 | LD14098852 | LD12868926 | NaN | NaN | NaN |
| 808458 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 18101292 | 51603645 | 51603645 | STANDARD | BN | 22-08-2020 | 22-08-2020 | ... | 3544.99999 | 3545.0 | 2768.8646 | 0.0 | 1 | LD14100562 | LD13980487 | NaN | NaN | NaN |
| 808459 | NaN | NaN | NaN | 24634277 | 71136882 | 71136882 | STANDARD | BN | 23-01-2022 | 08-02-2022 | ... | 5300.00001 | 5300.0 | 3258.2825 | 0.0 | 1 | LD17179489 | LD17153838 | LD16411216 | LD16158119 | NaN |
800902 rows × 29 columns
We now have about 8lakh distinct orders.
# Checking if there are same number of customers after removing duplicates
non_dup_df.email_address_key.nunique()
454317
Customers before and after removing the duplicates should remain the same and those both came out to be same.
We need to remove the orders which are returned.
# Orders to keep only non-return orders
non_returns = non_dup_df[non_dup_df['return_date'].isnull()]
non_returns
| merch_category_rollup | merch_sub_category_rollup | MERCH_PRODUCT_CATEGORY | email_address_key | basket_id | original_basket_id | order_type | site | basket_start_date | submit_date | ... | local_currency_after_disc_sale_amount | usd_after_disc_sale_amount | usd_after_disc_cost_amount | usd_exchange_credit_amount | quantity | DIAMOND_SKU_1 | DIAMOND_SKU_2 | DIAMOND_SKU_4 | DIAMOND_SKU_5 | OFFER_ID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | 16983312 | 81947282 | 81947282 | STANDARD | BN | 19-09-2020 | 19-09-2020 | ... | 5558.00000 | 5558.0 | 3726.6400 | 0.0 | 1 | LD12713965 | LD12142705 | LD12681121 | LD13880254 | NaN |
| 1 | Engagement | Engagement | BYO Semi-Mounts | 22822442 | 59639029 | 75113342 | EXCHANGE | BN | 21-09-2021 | 21-09-2021 | ... | 2149.00000 | 2149.0 | 1219.2000 | 2149.0 | 1 | LD14946369 | NaN | NaN | NaN | NaN |
| 2 | Engagement | Engagement | BYO Semi-Mounts | 19696924 | 23989389 | 23989389 | SPECIAL ORDER | BN | 06-06-2021 | 07-06-2021 | ... | 5621.00000 | 5621.0 | 3868.9700 | 0.0 | 1 | LD16111745 | NaN | NaN | NaN | NaN |
| 3 | Engagement | Engagement | BYO Semi-Mounts | 24120678 | 12758238 | 12758238 | STANDARD | BN | 20-09-2021 | 20-09-2021 | ... | 2759.00000 | 2759.0 | 1776.2600 | 0.0 | 1 | LD16655615 | NaN | NaN | NaN | NaN |
| 4 | Engagement | Engagement | BYO Semi-Mounts | 23536382 | 84045443 | 84045443 | SPECIAL ORDER | BN | 21-12-2020 | 21-12-2020 | ... | 2622.00000 | 2622.0 | 1305.7100 | 0.0 | 1 | LD14659784 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 808455 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 23036258 | 22989040 | 22989040 | STANDARD | BN | 15-09-2020 | 15-09-2020 | ... | 2380.00000 | 2380.0 | 1749.0308 | 0.0 | 1 | LD14113460 | LD14252074 | NaN | NaN | NaN |
| 808456 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 10708737 | 78485771 | 78485771 | STANDARD | BN | 16-02-2021 | 16-02-2021 | ... | 2357.00001 | 2357.0 | 1766.8096 | 0.0 | 1 | LD14409557 | LD14407739 | NaN | NaN | NaN |
| 808457 | Engagement | Engagement | BYO 3-Stone Metal | 19190442 | 64237203 | 64237203 | STANDARD | BN | 08-10-2020 | 08-10-2020 | ... | 6235.00001 | 6235.0 | 4448.4775 | 0.0 | 1 | LD14098852 | LD12868926 | NaN | NaN | NaN |
| 808458 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 18101292 | 51603645 | 51603645 | STANDARD | BN | 22-08-2020 | 22-08-2020 | ... | 3544.99999 | 3545.0 | 2768.8646 | 0.0 | 1 | LD14100562 | LD13980487 | NaN | NaN | NaN |
| 808459 | NaN | NaN | NaN | 24634277 | 71136882 | 71136882 | STANDARD | BN | 23-01-2022 | 08-02-2022 | ... | 5300.00001 | 5300.0 | 3258.2825 | 0.0 | 1 | LD17179489 | LD17153838 | LD16411216 | LD16158119 | NaN |
677956 rows × 29 columns
We now have 6,77,956 non-returned orders.
# Checking if the above data has only non-return orders
non_returns['return_date'].isnull().sum()
677956
As number of non-returned orders is equal to null values sum in return date column, the above data consists only of non-return orders.
# Merging Sales all the Datasets we have with Non-Returned orders dataset
sales_email = pd.merge(non_returns, email_df, on = 'email_address_key', how = 'left')
se_cust = pd.merge(sales_email,customer_df.drop_duplicates(['email_address']), left_on = 'email_Address',
right_on = 'email_address', how = 'left')
sec_prod = pd.merge(se_cust, product_df, left_on ='OFFER_ID', right_on = 'OFFER_KEY', how='left')
final_merged = pd.merge(sec_prod,address_df.drop_duplicates(['basket_id']), left_on=['basket_id'],right_on=['basket_id'],
how='left')
final_merged
| merch_category_rollup | merch_sub_category_rollup | MERCH_PRODUCT_CATEGORY | email_address_key | basket_id | original_basket_id | order_type | site | basket_start_date | submit_date | ... | derived_gender | age | OFFER_KEY_x | name | TARGET_GENDER | display_type | PRODUCT_KEY | OFFER_KEY_y | BILL_TO_ADDRESS_KEY | SHIP_TO_ADDRESS_KEY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | 16983312 | 81947282 | 81947282 | STANDARD | BN | 19-09-2020 | 19-09-2020 | ... | M | 55.0 | NaN | NaN | NaN | NaN | 2551021.0 | 62363.0 | 10474539.0 | 10474540.0 |
| 1 | Engagement | Engagement | BYO Semi-Mounts | 22822442 | 59639029 | 75113342 | EXCHANGE | BN | 21-09-2021 | 21-09-2021 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 3533894.0 | 76208.0 | 11277556.0 | 11030807.0 |
| 2 | Engagement | Engagement | BYO Semi-Mounts | 19696924 | 23989389 | 23989389 | SPECIAL ORDER | BN | 06-06-2021 | 07-06-2021 | ... | M | 31.0 | NaN | NaN | NaN | NaN | 3565789.0 | 76212.0 | 11129079.0 | 11129080.0 |
| 3 | Engagement | Engagement | BYO Semi-Mounts | 24120678 | 12758238 | 12758238 | STANDARD | BN | 20-09-2021 | 20-09-2021 | ... | M | 37.0 | NaN | NaN | NaN | NaN | 3601472.0 | 76207.0 | 11329644.0 | 11178488.0 |
| 4 | Engagement | Engagement | BYO Semi-Mounts | 23536382 | 84045443 | 84045443 | SPECIAL ORDER | BN | 21-12-2020 | 21-12-2020 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 3468286.0 | 76897.0 | 10759761.0 | 10759762.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 677951 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 23036258 | 22989040 | 22989040 | STANDARD | BN | 15-09-2020 | 15-09-2020 | ... | M | NaN | NaN | NaN | NaN | NaN | 2058088.0 | 51574.0 | 10466130.0 | 10466130.0 |
| 677952 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 10708737 | 78485771 | 78485771 | STANDARD | BN | 16-02-2021 | 16-02-2021 | ... | M | 39.0 | NaN | NaN | NaN | NaN | 3031615.0 | 50614.0 | 10893648.0 | 10893649.0 |
| 677953 | Engagement | Engagement | BYO 3-Stone Metal | 19190442 | 64237203 | 64237203 | STANDARD | BN | 08-10-2020 | 08-10-2020 | ... | M | 60.0 | NaN | NaN | NaN | NaN | 2098637.0 | 49876.0 | 10504692.0 | 10504692.0 |
| 677954 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 18101292 | 51603645 | 51603645 | STANDARD | BN | 22-08-2020 | 22-08-2020 | ... | M | NaN | NaN | NaN | NaN | NaN | 2058088.0 | 51574.0 | 10422395.0 | 10412311.0 |
| 677955 | NaN | NaN | NaN | 24634277 | 71136882 | 71136882 | STANDARD | BN | 23-01-2022 | 08-02-2022 | ... | F | NaN | NaN | NaN | NaN | NaN | 2551032.0 | 62363.0 | 11698964.0 | 11698965.0 |
677956 rows × 43 columns
Non-returned orders and Final Merged Orders have the same the number of rows and i.e., 6,77,956.
#checking if there are customers whose email address is not there in our dataset
pd.DataFrame(final_merged, columns=['email_address_key', 'email_Address']).nunique()
email_address_key 417027 email_Address 402782 dtype: int64
As Email Address Key and Email Address values doesn't match, there were some customers whose email address is not there in the dataset.
We need to remove those customer orders as we can't make any marketing campaigns or others without knowing Email address of a customer.
# Dropping the rows that have null values in email address column
final_merged = final_merged[final_merged['email_Address'].notna()]
final_merged
| merch_category_rollup | merch_sub_category_rollup | MERCH_PRODUCT_CATEGORY | email_address_key | basket_id | original_basket_id | order_type | site | basket_start_date | submit_date | ... | derived_gender | age | OFFER_KEY_x | name | TARGET_GENDER | display_type | PRODUCT_KEY | OFFER_KEY_y | BILL_TO_ADDRESS_KEY | SHIP_TO_ADDRESS_KEY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | 16983312 | 81947282 | 81947282 | STANDARD | BN | 19-09-2020 | 19-09-2020 | ... | M | 55.0 | NaN | NaN | NaN | NaN | 2551021.0 | 62363.0 | 10474539.0 | 10474540.0 |
| 2 | Engagement | Engagement | BYO Semi-Mounts | 19696924 | 23989389 | 23989389 | SPECIAL ORDER | BN | 06-06-2021 | 07-06-2021 | ... | M | 31.0 | NaN | NaN | NaN | NaN | 3565789.0 | 76212.0 | 11129079.0 | 11129080.0 |
| 3 | Engagement | Engagement | BYO Semi-Mounts | 24120678 | 12758238 | 12758238 | STANDARD | BN | 20-09-2021 | 20-09-2021 | ... | M | 37.0 | NaN | NaN | NaN | NaN | 3601472.0 | 76207.0 | 11329644.0 | 11178488.0 |
| 5 | Engagement | Engagement | BYO Semi-Mounts | 24092029 | 18812838 | 18812838 | STANDARD | BN | 11-07-2021 | 18-08-2021 | ... | M | 34.0 | NaN | NaN | NaN | NaN | 2604603.0 | 60452.0 | 11264941.0 | 11264941.0 |
| 6 | Engagement | Engagement | BYO Semi-Mounts | 24079756 | 47744213 | 47744213 | STANDARD | BN | 09-06-2021 | 09-06-2021 | ... | M | NaN | NaN | NaN | NaN | NaN | 2888321.0 | 63046.0 | 11137487.0 | 11135306.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 677951 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 23036258 | 22989040 | 22989040 | STANDARD | BN | 15-09-2020 | 15-09-2020 | ... | M | NaN | NaN | NaN | NaN | NaN | 2058088.0 | 51574.0 | 10466130.0 | 10466130.0 |
| 677952 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 10708737 | 78485771 | 78485771 | STANDARD | BN | 16-02-2021 | 16-02-2021 | ... | M | 39.0 | NaN | NaN | NaN | NaN | 3031615.0 | 50614.0 | 10893648.0 | 10893649.0 |
| 677953 | Engagement | Engagement | BYO 3-Stone Metal | 19190442 | 64237203 | 64237203 | STANDARD | BN | 08-10-2020 | 08-10-2020 | ... | M | 60.0 | NaN | NaN | NaN | NaN | 2098637.0 | 49876.0 | 10504692.0 | 10504692.0 |
| 677954 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 18101292 | 51603645 | 51603645 | STANDARD | BN | 22-08-2020 | 22-08-2020 | ... | M | NaN | NaN | NaN | NaN | NaN | 2058088.0 | 51574.0 | 10422395.0 | 10412311.0 |
| 677955 | NaN | NaN | NaN | 24634277 | 71136882 | 71136882 | STANDARD | BN | 23-01-2022 | 08-02-2022 | ... | F | NaN | NaN | NaN | NaN | NaN | 2551032.0 | 62363.0 | 11698964.0 | 11698965.0 |
655209 rows × 43 columns
After dropping those rows, we have 6,55,209 orders.
# Checking if all the rows without email address were removed are not
pd.DataFrame(final_merged, columns=['email_address_key', 'email_Address']).nunique()
email_address_key 402782 email_Address 402782 dtype: int64
As the count of both columns is equal, all those rows were dropped and now we have 4,02,782 distinct customers.
# Pie chart representation of Distribution of Customers and Total Orders over different Categories
cust_dist = final_merged.groupby('merch_category_rollup').agg({'merch_category_rollup':'first', 'email_address_key':'nunique'})
order_dist = final_merged.groupby('merch_category_rollup').agg({'merch_category_rollup':'first', 'basket_id':'nunique'})
a, b = cust_dist.index, order_dist.index
c, d = cust_dist['email_address_key'], order_dist['basket_id']
fig = plt.figure(figsize=(18,10), dpi=1600)
# Customer Distribution Plot
ax1 = plt.subplot2grid((2,2),(0,0))
plt.pie(c, labels = a, radius = 1.5,textprops = {"fontsize":15}, autopct = "%1.1f%%",
wedgeprops = {"edgecolor":"black",'linewidth':0.5})
plt.title('Customers Distribution', y = 1.15)
# Order Distribution Plot
ax1 = plt.subplot2grid((2,2), (0,1))
plt.pie(d, labels = b, radius = 1.5,textprops = {"fontsize":15}, autopct = "%1.1f%%",
wedgeprops = {"edgecolor":"black",'linewidth':0.5})
plt.title('Orders Distribution', y = 1.15)
plt.show()
# Checking data types of columns in the data
final_merged.dtypes
merch_category_rollup object merch_sub_category_rollup object MERCH_PRODUCT_CATEGORY object email_address_key int64 basket_id int64 original_basket_id int64 order_type object site object basket_start_date object submit_date object order_date object ship_date object ship_delivery_date object return_date object discount_name object local_currency object promo_discount_applied_flag object local_currency_before_disc_sale_amount float64 discount_promo_amount float64 local_currency_after_disc_sale_amount float64 usd_after_disc_sale_amount float64 usd_after_disc_cost_amount float64 usd_exchange_credit_amount float64 quantity int64 DIAMOND_SKU_1 object DIAMOND_SKU_2 object DIAMOND_SKU_4 object DIAMOND_SKU_5 object OFFER_ID float64 email_Address object email_address object first_name object last_name object derived_gender object age float64 OFFER_KEY_x float64 name object TARGET_GENDER object display_type object PRODUCT_KEY float64 OFFER_KEY_y float64 BILL_TO_ADDRESS_KEY float64 SHIP_TO_ADDRESS_KEY float64 dtype: object
# Filling na values of shipping, billing address columns with -1 to preserve the int datatype and to distinguish null values
final_merged[['SHIP_TO_ADDRESS_KEY','BILL_TO_ADDRESS_KEY']] = final_merged[['SHIP_TO_ADDRESS_KEY','BILL_TO_ADDRESS_KEY']].fillna(-1)
# Changing needed columns dtypes
final_merged = final_merged.astype({"SHIP_TO_ADDRESS_KEY" : int, "BILL_TO_ADDRESS_KEY" : int,
"email_address_key" : object, "basket_id" : object})
final_merged["submit_date"] = pd.to_datetime(final_merged["submit_date"], dayfirst = True)
final_merged["order_date"] = pd.to_datetime(final_merged["order_date"], dayfirst = True)
# Checking for dtypes after change
pd.DataFrame(final_merged, columns=['SHIP_TO_ADDRESS_KEY','BILL_TO_ADDRESS_KEY','basket_id','submit_date','order_date']).dtypes
C:\Users\affine\AppData\Local\Temp\ipykernel_16568\2691718885.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy final_merged[['SHIP_TO_ADDRESS_KEY','BILL_TO_ADDRESS_KEY']] = final_merged[['SHIP_TO_ADDRESS_KEY','BILL_TO_ADDRESS_KEY']].fillna(-1)
SHIP_TO_ADDRESS_KEY int32 BILL_TO_ADDRESS_KEY int32 basket_id object submit_date datetime64[ns] order_date datetime64[ns] dtype: object
# Getting shape of our final jewelry dataset
final_merged.shape
(655209, 43)
list(final_merged.columns)
['merch_category_rollup', 'merch_sub_category_rollup', 'MERCH_PRODUCT_CATEGORY', 'email_address_key', 'basket_id', 'original_basket_id', 'order_type', 'site', 'basket_start_date', 'submit_date', 'order_date', 'ship_date', 'ship_delivery_date', 'return_date', 'discount_name', 'local_currency', 'promo_discount_applied_flag', 'local_currency_before_disc_sale_amount', 'discount_promo_amount', 'local_currency_after_disc_sale_amount', 'usd_after_disc_sale_amount', 'usd_after_disc_cost_amount', 'usd_exchange_credit_amount', 'quantity', 'DIAMOND_SKU_1', 'DIAMOND_SKU_2', 'DIAMOND_SKU_4', 'DIAMOND_SKU_5', 'OFFER_ID', 'email_Address', 'email_address', 'first_name', 'last_name', 'derived_gender', 'age', 'OFFER_KEY_x', 'name', 'TARGET_GENDER', 'display_type', 'PRODUCT_KEY', 'OFFER_KEY_y', 'BILL_TO_ADDRESS_KEY', 'SHIP_TO_ADDRESS_KEY']
# Finding how many customers we have
final_merged.email_address_key.nunique()
402782
The total number of unique customers are 4,02,782.
# Getting data of final_merged
final_merged.head()
| merch_category_rollup | merch_sub_category_rollup | MERCH_PRODUCT_CATEGORY | email_address_key | basket_id | original_basket_id | order_type | site | basket_start_date | submit_date | ... | derived_gender | age | OFFER_KEY_x | name | TARGET_GENDER | display_type | PRODUCT_KEY | OFFER_KEY_y | BILL_TO_ADDRESS_KEY | SHIP_TO_ADDRESS_KEY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | 16983312 | 81947282 | 81947282 | STANDARD | BN | 19-09-2020 | 2020-09-19 | ... | M | 55.0 | NaN | NaN | NaN | NaN | 2551021.0 | 62363.0 | 10474539 | 10474540 |
| 2 | Engagement | Engagement | BYO Semi-Mounts | 19696924 | 23989389 | 23989389 | SPECIAL ORDER | BN | 06-06-2021 | 2021-06-07 | ... | M | 31.0 | NaN | NaN | NaN | NaN | 3565789.0 | 76212.0 | 11129079 | 11129080 |
| 3 | Engagement | Engagement | BYO Semi-Mounts | 24120678 | 12758238 | 12758238 | STANDARD | BN | 20-09-2021 | 2021-09-20 | ... | M | 37.0 | NaN | NaN | NaN | NaN | 3601472.0 | 76207.0 | 11329644 | 11178488 |
| 5 | Engagement | Engagement | BYO Semi-Mounts | 24092029 | 18812838 | 18812838 | STANDARD | BN | 11-07-2021 | 2021-08-18 | ... | M | 34.0 | NaN | NaN | NaN | NaN | 2604603.0 | 60452.0 | 11264941 | 11264941 |
| 6 | Engagement | Engagement | BYO Semi-Mounts | 24079756 | 47744213 | 47744213 | STANDARD | BN | 09-06-2021 | 2021-06-09 | ... | M | NaN | NaN | NaN | NaN | NaN | 2888321.0 | 63046.0 | 11137487 | 11135306 |
5 rows × 43 columns
# Creating a Sales Report with necessary columns
report = final_merged.rename(columns = {'derived_gender':'customer_gender'})[['email_address_key','TARGET_GENDER',
'customer_gender','age','basket_id',
'usd_after_disc_sale_amount']]
report.head()
| email_address_key | TARGET_GENDER | customer_gender | age | basket_id | usd_after_disc_sale_amount | |
|---|---|---|---|---|---|---|
| 0 | 16983312 | NaN | M | 55.0 | 81947282 | 5558.0 |
| 2 | 19696924 | NaN | M | 31.0 | 23989389 | 5621.0 |
| 3 | 24120678 | NaN | M | 37.0 | 12758238 | 2759.0 |
| 5 | 24092029 | NaN | M | 34.0 | 18812838 | 3178.0 |
| 6 | 24079756 | NaN | M | NaN | 47744213 | 1668.0 |
# Grouping orders and sales for each customer
col = {'basket_id':'no_of_orders', 'usd_after_disc_sale_amount':'sum_of_sales'}
report = report.groupby('email_address_key').agg({'customer_gender':'first','age':'first','basket_id':'nunique',
'usd_after_disc_sale_amount':'sum'}).rename(columns = col)
# Sorting by email address key for better understanding
report.sort_values(by=['email_address_key'], ascending = True)
| customer_gender | age | no_of_orders | sum_of_sales | |
|---|---|---|---|---|
| email_address_key | ||||
| 692 | F | 46.0 | 1 | 862.50 |
| 959 | F | NaN | 1 | 8172.69 |
| 1084 | M | 47.0 | 1 | 2242.50 |
| 1359 | M | 62.0 | 2 | 1041.00 |
| 9775 | M | NaN | 1 | 3435.00 |
| ... | ... | ... | ... | ... |
| 31490633 | M | NaN | 1 | 3490.50 |
| 31490634 | F | NaN | 1 | 50.54 |
| 31490635 | M | NaN | 1 | 203.00 |
| 31490636 | F | NaN | 1 | 1032.00 |
| 31490637 | M | NaN | 1 | 160.00 |
402782 rows × 4 columns
This is the Sales Report for the orders, here we have data for all 4,02,782 customers like the count of orders a particular customer made so far and what's the spending amount of that customer along with that customer Gender and Age to get an overall idea.
# Checking unique count of orders
report.no_of_orders.unique()
array([ 1, 2, 3, 9, 14, 4, 6, 7, 5, 8, 10,
13, 11, 12, 18, 16, 20, 17, 15, 21, 24, 27,
121, 30, 25, 31, 403, 33, 22, 23, 29, 3602, 32,
28, 38, 43, 19, 395, 26, 46, 73, 34, 65, 58,
722, 100], dtype=int64)
We have the number of orders done by a customer ranging from 1-3602.
# Creating customer type column to get an Idea about customers
report['customer_type'] = np.where(report['no_of_orders'] > 1, 1, 0)
report.sort_values('sum_of_sales', ascending = False, inplace = True)
report.head()
| customer_gender | age | no_of_orders | sum_of_sales | customer_type | |
|---|---|---|---|---|---|
| email_address_key | |||||
| 19666108 | None | NaN | 3602 | 5989330.96 | 1 |
| 18082833 | None | NaN | 403 | 2267026.85 | 1 |
| 24277775 | M | NaN | 1 | 1000000.00 | 0 |
| 23810664 | F | NaN | 722 | 910152.19 | 1 |
| 24668292 | F | NaN | 2 | 818452.46 | 1 |
#checking unique values in customer type column
report.customer_type.unique()
array([1, 0])
Customer Type Column Info
# Grouping customer type
a = (report['customer_type'].value_counts(normalize=True) * 100).round(decimals=2)
b = report['customer_type'].value_counts()
a, b
(0 82.45 1 17.55 Name: customer_type, dtype: float64, 0 332085 1 70697 Name: customer_type, dtype: int64)
We have
# Only keeping the Jewelry category data in non-returns final merged data
jewelry_df = final_merged.loc[(final_merged['merch_category_rollup'] == 'Diamond Jewelry') |
(final_merged['merch_category_rollup'] == 'Other Jewelry') |
(final_merged['merch_category_rollup'] == 'Bands')]
jewelry_df
| merch_category_rollup | merch_sub_category_rollup | MERCH_PRODUCT_CATEGORY | email_address_key | basket_id | original_basket_id | order_type | site | basket_start_date | submit_date | ... | derived_gender | age | OFFER_KEY_x | name | TARGET_GENDER | display_type | PRODUCT_KEY | OFFER_KEY_y | BILL_TO_ADDRESS_KEY | SHIP_TO_ADDRESS_KEY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 41 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 31371922 | 45832107 | 45832107 | STANDARD | BN | 30-04-2022 | 2022-04-30 | ... | F | NaN | NaN | NaN | NaN | NaN | 3697562.0 | 50619.0 | 11855742 | 11855742 |
| 42 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 12192824 | 26092410 | 26092410 | STANDARD | BN | 11-04-2022 | 2022-04-11 | ... | M | 41.0 | NaN | NaN | NaN | NaN | 2058073.0 | 51584.0 | 7867151 | 11817631 |
| 43 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 22167265 | 15123721 | 15123721 | STANDARD | BN | 18-10-2021 | 2021-10-18 | ... | M | NaN | NaN | NaN | NaN | NaN | 3240108.0 | 50673.0 | 11384809 | 11384810 |
| 44 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 23403248 | 28743548 | 28743548 | STANDARD | BN | 30-11-2020 | 2020-11-30 | ... | M | NaN | NaN | NaN | NaN | NaN | 3031577.0 | 50618.0 | 10642691 | 10642692 |
| 45 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 1590647 | 30915560 | 30915560 | STANDARD | BN | 21-02-2021 | 2021-02-21 | ... | F | NaN | NaN | NaN | NaN | NaN | 2058089.0 | 51571.0 | 10901412 | 10901412 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 677949 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 3740508 | 48317500 | 48317500 | STANDARD | BN | 28-08-2020 | 2020-08-28 | ... | F | NaN | NaN | NaN | NaN | NaN | 2058073.0 | 51584.0 | 10435269 | 10435269 |
| 677950 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 22859237 | 64797963 | 64797963 | STANDARD | BN | 10-08-2020 | 2020-08-10 | ... | M | 42.0 | NaN | NaN | NaN | NaN | 2058088.0 | 51574.0 | 10399071 | 10399071 |
| 677951 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 23036258 | 22989040 | 22989040 | STANDARD | BN | 15-09-2020 | 2020-09-15 | ... | M | NaN | NaN | NaN | NaN | NaN | 2058088.0 | 51574.0 | 10466130 | 10466130 |
| 677952 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 10708737 | 78485771 | 78485771 | STANDARD | BN | 16-02-2021 | 2021-02-16 | ... | M | 39.0 | NaN | NaN | NaN | NaN | 3031615.0 | 50614.0 | 10893648 | 10893649 |
| 677954 | Diamond Jewelry | Diamond Jewelry Basic BYO | BYO Stud Metal | 18101292 | 51603645 | 51603645 | STANDARD | BN | 22-08-2020 | 2020-08-22 | ... | M | NaN | NaN | NaN | NaN | NaN | 2058088.0 | 51574.0 | 10422395 | 10412311 |
545821 rows × 43 columns
Jewelry data has about 5,45,821 orders.
# Checking if the categories remained in the above data do belong to only those 3 categories
jewelry_df.merch_category_rollup.unique()
array(['Diamond Jewelry', 'Bands', 'Other Jewelry'], dtype=object)
# Columns in the jewelry data
list(jewelry_df.columns)
['merch_category_rollup', 'merch_sub_category_rollup', 'MERCH_PRODUCT_CATEGORY', 'email_address_key', 'basket_id', 'original_basket_id', 'order_type', 'site', 'basket_start_date', 'submit_date', 'order_date', 'ship_date', 'ship_delivery_date', 'return_date', 'discount_name', 'local_currency', 'promo_discount_applied_flag', 'local_currency_before_disc_sale_amount', 'discount_promo_amount', 'local_currency_after_disc_sale_amount', 'usd_after_disc_sale_amount', 'usd_after_disc_cost_amount', 'usd_exchange_credit_amount', 'quantity', 'DIAMOND_SKU_1', 'DIAMOND_SKU_2', 'DIAMOND_SKU_4', 'DIAMOND_SKU_5', 'OFFER_ID', 'email_Address', 'email_address', 'first_name', 'last_name', 'derived_gender', 'age', 'OFFER_KEY_x', 'name', 'TARGET_GENDER', 'display_type', 'PRODUCT_KEY', 'OFFER_KEY_y', 'BILL_TO_ADDRESS_KEY', 'SHIP_TO_ADDRESS_KEY']
# Dropping unnecessary and repetitive columns due to joins
jewelry_df=jewelry_df.drop(['merch_category_rollup','merch_sub_category_rollup','MERCH_PRODUCT_CATEGORY','original_basket_id',
'order_type','site','basket_start_date','ship_date','ship_delivery_date','return_date',
'discount_name','local_currency','promo_discount_applied_flag',
'local_currency_before_disc_sale_amount','discount_promo_amount',
'local_currency_after_disc_sale_amount','usd_after_disc_cost_amount','usd_exchange_credit_amount',
'quantity','DIAMOND_SKU_1','DIAMOND_SKU_2','DIAMOND_SKU_4','DIAMOND_SKU_5','OFFER_ID',
'email_Address','email_address','first_name','last_name','OFFER_KEY_x','name','display_type',
'PRODUCT_KEY','OFFER_KEY_y'],axis=1)
jewelry_df.head()
| email_address_key | basket_id | submit_date | order_date | usd_after_disc_sale_amount | derived_gender | age | TARGET_GENDER | BILL_TO_ADDRESS_KEY | SHIP_TO_ADDRESS_KEY | |
|---|---|---|---|---|---|---|---|---|---|---|
| 41 | 31371922 | 45832107 | 2022-04-30 | 2022-04-30 | 1481.0 | F | NaN | NaN | 11855742 | 11855742 |
| 42 | 12192824 | 26092410 | 2022-04-11 | 2022-04-11 | 2625.0 | M | 41.0 | NaN | 7867151 | 11817631 |
| 43 | 22167265 | 15123721 | 2021-10-18 | 2021-10-18 | 3778.0 | M | NaN | NaN | 11384809 | 11384810 |
| 44 | 23403248 | 28743548 | 2020-11-30 | 2020-11-30 | 1124.0 | M | NaN | NaN | 10642691 | 10642692 |
| 45 | 1590647 | 30915560 | 2021-02-21 | 2021-02-21 | 1543.0 | F | NaN | NaN | 10901412 | 10901412 |
# To get info of a particular customer
check = jewelry_df.loc[(jewelry_df['email_address_key'] == 11625)]
check.sort_values(by = ['basket_id']).head(10)
| email_address_key | basket_id | submit_date | order_date | usd_after_disc_sale_amount | derived_gender | age | TARGET_GENDER | BILL_TO_ADDRESS_KEY | SHIP_TO_ADDRESS_KEY | |
|---|---|---|---|---|---|---|---|---|---|---|
| 96549 | 11625 | 12362718 | 2021-06-23 | 2021-06-23 | 5.00 | F | 59.0 | None | 10791650 | 11160185 |
| 628354 | 11625 | 17315472 | 2021-11-24 | 2021-11-24 | 359.82 | F | 59.0 | Female | 11181371 | 11445703 |
| 97317 | 11625 | 17315472 | 2021-11-24 | 2021-11-24 | 5.00 | F | 59.0 | None | 11181371 | 11445703 |
| 618094 | 11625 | 26449813 | 2021-07-06 | 2021-07-06 | 290.44 | F | 59.0 | Female | 11181371 | 11160185 |
| 586705 | 11625 | 26449813 | 2021-07-06 | 2021-07-06 | 98.25 | F | 59.0 | Female | 11181371 | 11160185 |
| 525602 | 11625 | 26449813 | 2021-07-06 | 2021-07-06 | 15.75 | F | 59.0 | Female | 11181371 | 11160185 |
| 631203 | 11625 | 26449813 | 2021-07-06 | 2021-07-06 | 19.69 | F | 59.0 | Female | 11181371 | 11160185 |
| 96846 | 11625 | 26449813 | 2021-07-06 | 2021-07-06 | 5.00 | F | 59.0 | None | 11181371 | 11160185 |
| 94755 | 11625 | 29295317 | 2021-01-05 | 2021-01-06 | 5.00 | F | 59.0 | None | 10791650 | 10791651 |
| 661967 | 11625 | 29295317 | 2021-01-05 | 2021-01-06 | 29.73 | F | 59.0 | Female | 10791650 | 10791651 |
Rules\ Gifter
Self-Gifter
# Taking the final Jewelry data for segmentation
jewelry_df.head()
| email_address_key | basket_id | submit_date | order_date | usd_after_disc_sale_amount | derived_gender | age | TARGET_GENDER | BILL_TO_ADDRESS_KEY | SHIP_TO_ADDRESS_KEY | |
|---|---|---|---|---|---|---|---|---|---|---|
| 41 | 31371922 | 45832107 | 2022-04-30 | 2022-04-30 | 1481.0 | F | NaN | NaN | 11855742 | 11855742 |
| 42 | 12192824 | 26092410 | 2022-04-11 | 2022-04-11 | 2625.0 | M | 41.0 | NaN | 7867151 | 11817631 |
| 43 | 22167265 | 15123721 | 2021-10-18 | 2021-10-18 | 3778.0 | M | NaN | NaN | 11384809 | 11384810 |
| 44 | 23403248 | 28743548 | 2020-11-30 | 2020-11-30 | 1124.0 | M | NaN | NaN | 10642691 | 10642692 |
| 45 | 1590647 | 30915560 | 2021-02-21 | 2021-02-21 | 1543.0 | F | NaN | NaN | 10901412 | 10901412 |
# Keeping only columns that we need
analytical_df = jewelry_df.rename(columns={'email_address_key':'customer','basket_id':'order','derived_gender':'customer_gender'
,'TARGET_GENDER':'target_gender','SHIP_TO_ADDRESS_KEY':'shipping_add',
'BILL_TO_ADDRESS_KEY':'billing_add'})[['customer','order','target_gender',
'customer_gender','shipping_add',
'billing_add']]
analytical_df.head()
| customer | order | target_gender | customer_gender | shipping_add | billing_add | |
|---|---|---|---|---|---|---|
| 41 | 31371922 | 45832107 | NaN | F | 11855742 | 11855742 |
| 42 | 12192824 | 26092410 | NaN | M | 11817631 | 7867151 |
| 43 | 22167265 | 15123721 | NaN | M | 11384810 | 11384809 |
| 44 | 23403248 | 28743548 | NaN | M | 10642692 | 10642691 |
| 45 | 1590647 | 30915560 | NaN | F | 10901412 | 10901412 |
# Checking unique values in target gender
analytical_df.target_gender.unique()
array([nan, 'Female', 'Male', 'None', 'Both'], dtype=object)
# Analytical data column for target gender
analytical_df['target_gen'] = np.where(analytical_df['target_gender']=='Male', 1,
np.where(analytical_df['target_gender']=='Female', 2,
np.where(analytical_df['target_gender']=='Both', 0,
np.where(analytical_df['target_gender']=='None', 0, 0))))
analytical_df.head()
| customer | order | target_gender | customer_gender | shipping_add | billing_add | target_gen | |
|---|---|---|---|---|---|---|---|
| 41 | 31371922 | 45832107 | NaN | F | 11855742 | 11855742 | 0 |
| 42 | 12192824 | 26092410 | NaN | M | 11817631 | 7867151 | 0 |
| 43 | 22167265 | 15123721 | NaN | M | 11384810 | 11384809 | 0 |
| 44 | 23403248 | 28743548 | NaN | M | 10642692 | 10642691 | 0 |
| 45 | 1590647 | 30915560 | NaN | F | 10901412 | 10901412 | 0 |
# Checking unique values in target gender analytical column
analytical_df.target_gen.unique()
array([0, 2, 1])
Target Gender Analytical Column Info
# Checking unique in customer gender
analytical_df.customer_gender.unique()
array(['F', 'M', nan], dtype=object)
# Analytical data column for customer gender
analytical_df['customer_gen'] = np.where(analytical_df['customer_gender']=='M', 1,
np.where(analytical_df['customer_gender']=='F', 2, 0))
analytical_df.head()
| customer | order | target_gender | customer_gender | shipping_add | billing_add | target_gen | customer_gen | |
|---|---|---|---|---|---|---|---|---|
| 41 | 31371922 | 45832107 | NaN | F | 11855742 | 11855742 | 0 | 2 |
| 42 | 12192824 | 26092410 | NaN | M | 11817631 | 7867151 | 0 | 1 |
| 43 | 22167265 | 15123721 | NaN | M | 11384810 | 11384809 | 0 | 1 |
| 44 | 23403248 | 28743548 | NaN | M | 10642692 | 10642691 | 0 | 1 |
| 45 | 1590647 | 30915560 | NaN | F | 10901412 | 10901412 | 0 | 2 |
# Checking unique values in customer gender analytical column
analytical_df.customer_gen.unique()
array([2, 1, 0])
Customer Gender Analytical Column Info
# Greating a separate dataset for null rows(filled with -1) of shipping address and billing address
nan_add = analytical_df[(analytical_df.shipping_add == -1) | (analytical_df.billing_add == -1)]
nan_add
| customer | order | target_gender | customer_gender | shipping_add | billing_add | target_gen | customer_gen | |
|---|---|---|---|---|---|---|---|---|
| 176598 | 23628820 | 80857940 | None | F | -1 | -1 | 0 | 2 |
| 176670 | 12139903 | 57607230 | None | M | -1 | -1 | 0 | 1 |
| 176720 | 17688936 | 92484802 | None | F | -1 | -1 | 0 | 2 |
| 176859 | 15718378 | 74604902 | None | M | -1 | -1 | 0 | 1 |
| 176880 | 23537079 | 21144903 | None | F | -1 | -1 | 0 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 191243 | 20544734 | 43859729 | None | M | -1 | -1 | 0 | 1 |
| 191305 | 24585987 | 67772519 | None | M | -1 | -1 | 0 | 1 |
| 191321 | 24677732 | 84289742 | None | F | -1 | -1 | 0 | 2 |
| 652119 | 31445010 | 21145863 | None | M | -1 | -1 | 0 | 1 |
| 667185 | 23981434 | 90094119 | None | M | -1 | -1 | 0 | 1 |
223 rows × 8 columns
We have 223 orders that have null Shipping/Billing Addresses.
# Analytical data column for addresses
nan_add['address'] = np.where(nan_add['shipping_add'] == -1, 0,
np.where(nan_add['billing_add'] == -1, 0, 1))
nan_add.head()
C:\Users\affine\AppData\Local\Temp\ipykernel_16568\3818571357.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy nan_add['address'] = np.where(nan_add['shipping_add'] == -1, 0,
| customer | order | target_gender | customer_gender | shipping_add | billing_add | target_gen | customer_gen | address | |
|---|---|---|---|---|---|---|---|---|---|
| 176598 | 23628820 | 80857940 | None | F | -1 | -1 | 0 | 2 | 0 |
| 176670 | 12139903 | 57607230 | None | M | -1 | -1 | 0 | 1 | 0 |
| 176720 | 17688936 | 92484802 | None | F | -1 | -1 | 0 | 2 | 0 |
| 176859 | 15718378 | 74604902 | None | M | -1 | -1 | 0 | 1 | 0 |
| 176880 | 23537079 | 21144903 | None | F | -1 | -1 | 0 | 2 | 0 |
# Checking unique values of address column in null dataset
nan_add.address.unique()
array([0])
# Creating a separate dataset for not null rows(not equal to -1) of shipping and billing addresses
non_nan_add = analytical_df[(analytical_df.shipping_add != -1) | (analytical_df.billing_add != -1)]
non_nan_add
| customer | order | target_gender | customer_gender | shipping_add | billing_add | target_gen | customer_gen | |
|---|---|---|---|---|---|---|---|---|
| 41 | 31371922 | 45832107 | NaN | F | 11855742 | 11855742 | 0 | 2 |
| 42 | 12192824 | 26092410 | NaN | M | 11817631 | 7867151 | 0 | 1 |
| 43 | 22167265 | 15123721 | NaN | M | 11384810 | 11384809 | 0 | 1 |
| 44 | 23403248 | 28743548 | NaN | M | 10642692 | 10642691 | 0 | 1 |
| 45 | 1590647 | 30915560 | NaN | F | 10901412 | 10901412 | 0 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 677949 | 3740508 | 48317500 | NaN | F | 10435269 | 10435269 | 0 | 2 |
| 677950 | 22859237 | 64797963 | NaN | M | 10399071 | 10399071 | 0 | 1 |
| 677951 | 23036258 | 22989040 | NaN | M | 10466130 | 10466130 | 0 | 1 |
| 677952 | 10708737 | 78485771 | NaN | M | 10893649 | 10893648 | 0 | 1 |
| 677954 | 18101292 | 51603645 | NaN | M | 10412311 | 10422395 | 0 | 1 |
545598 rows × 8 columns
# Analytical data column for addresses
non_nan_add['address'] = np.where(non_nan_add['shipping_add']==non_nan_add['billing_add'], 1,
np.where(non_nan_add['shipping_add']!=non_nan_add['billing_add'], 2, 0))
non_nan_add.head()
C:\Users\affine\AppData\Local\Temp\ipykernel_16568\1429238194.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy non_nan_add['address'] = np.where(non_nan_add['shipping_add']==non_nan_add['billing_add'], 1,
| customer | order | target_gender | customer_gender | shipping_add | billing_add | target_gen | customer_gen | address | |
|---|---|---|---|---|---|---|---|---|---|
| 41 | 31371922 | 45832107 | NaN | F | 11855742 | 11855742 | 0 | 2 | 1 |
| 42 | 12192824 | 26092410 | NaN | M | 11817631 | 7867151 | 0 | 1 | 2 |
| 43 | 22167265 | 15123721 | NaN | M | 11384810 | 11384809 | 0 | 1 | 2 |
| 44 | 23403248 | 28743548 | NaN | M | 10642692 | 10642691 | 0 | 1 | 2 |
| 45 | 1590647 | 30915560 | NaN | F | 10901412 | 10901412 | 0 | 2 | 1 |
# Checking unique values of address column in non null dataset
non_nan_add.address.unique()
array([1, 2])
# Merging these two datas
analytical_df = pd.concat([nan_add, non_nan_add])
analytical_df.sort_values('customer', ascending = True, inplace = True)
analytical_df
| customer | order | target_gender | customer_gender | shipping_add | billing_add | target_gen | customer_gen | address | |
|---|---|---|---|---|---|---|---|---|---|
| 611486 | 692 | 58224365 | Female | F | 11565604 | 5394034 | 2 | 2 | 2 |
| 16691 | 959 | 59011871 | NaN | F | 11822256 | 3839390 | 0 | 2 | 2 |
| 175629 | 959 | 59011871 | Female | F | 11822256 | 3839390 | 2 | 2 | 2 |
| 452063 | 1084 | 11849472 | Female | M | 11580711 | 11580710 | 2 | 1 | 2 |
| 598424 | 1359 | 78326444 | None | M | 11563268 | 11563268 | 0 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 128085 | 31490634 | 68606731 | None | F | 12031454 | 12031454 | 0 | 2 | 1 |
| 560725 | 31490635 | 93951222 | Female | M | 12031473 | 12031473 | 2 | 1 | 1 |
| 398927 | 31490636 | 56973875 | Female | F | 12031477 | 12031477 | 2 | 2 | 1 |
| 214912 | 31490637 | 70914185 | Female | M | 12031478 | 12031478 | 2 | 1 | 1 |
| 249992 | 31490637 | 70914185 | None | M | 12031478 | 12031478 | 0 | 1 | 1 |
545821 rows × 9 columns
# Checking unique values of address column of analytical dataset
analytical_df.address.unique()
array([2, 1, 0])
Address Analytical Column Info
# Quick check to see if the analytical dataset has the right data
qc = analytical_df.loc[(jewelry_df['email_address_key'] == 10106)]
qc
| customer | order | target_gender | customer_gender | shipping_add | billing_add | target_gen | customer_gen | address | |
|---|---|---|---|---|---|---|---|---|---|
| 262754 | 10106 | 78391343 | None | M | 11550719 | 3306065 | 0 | 1 | 2 |
| 441382 | 10106 | 78391343 | Female | M | 11550719 | 3306065 | 2 | 1 | 2 |
| 505052 | 10106 | 78391343 | Female | M | 11550719 | 3306065 | 2 | 1 | 2 |
# Dropping original gender and address columns as we have analytical dataset
analytical_df = analytical_df.drop(['target_gender','customer_gender','shipping_add','billing_add'], axis = 1)
analytical_df.head()
| customer | order | target_gen | customer_gen | address | |
|---|---|---|---|---|---|
| 611486 | 692 | 58224365 | 2 | 2 | 2 |
| 16691 | 959 | 59011871 | 0 | 2 | 2 |
| 175629 | 959 | 59011871 | 2 | 2 | 2 |
| 452063 | 1084 | 11849472 | 2 | 1 | 2 |
| 598424 | 1359 | 78326444 | 0 | 1 | 1 |
# Creating gifter flag using the rules
analytical_df['gifter'] = np.where((analytical_df['target_gen']==1) & (analytical_df['customer_gen']==2), 1,
np.where((analytical_df['target_gen']==2) & (analytical_df['customer_gen']==1), 1 ,
np.where((analytical_df['target_gen']==1) & (analytical_df['customer_gen']==1)
& (analytical_df['address']==2), 1,
np.where((analytical_df['target_gen']==2) & (analytical_df['customer_gen']==2)
& (analytical_df['address']==2), 1,
np.where((analytical_df['target_gen']==0) & (analytical_df['address']==2), 1,
np.where((analytical_df['customer_gen']==0) & (analytical_df['address']==2), 1, 0))))))
analytical_df.head()
| customer | order | target_gen | customer_gen | address | gifter | |
|---|---|---|---|---|---|---|
| 611486 | 692 | 58224365 | 2 | 2 | 2 | 1 |
| 16691 | 959 | 59011871 | 0 | 2 | 2 | 1 |
| 175629 | 959 | 59011871 | 2 | 2 | 2 | 1 |
| 452063 | 1084 | 11849472 | 2 | 1 | 2 | 1 |
| 598424 | 1359 | 78326444 | 0 | 1 | 1 | 0 |
# Creating self-gifter flag
analytical_df['self_gifter'] = np.where((analytical_df['gifter'] == 1), 0,
np.where((analytical_df['gifter'] == 0) & (analytical_df['address']==0), 0, 1))
analytical_df.head()
| customer | order | target_gen | customer_gen | address | gifter | self_gifter | |
|---|---|---|---|---|---|---|---|
| 611486 | 692 | 58224365 | 2 | 2 | 2 | 1 | 0 |
| 16691 | 959 | 59011871 | 0 | 2 | 2 | 1 | 0 |
| 175629 | 959 | 59011871 | 2 | 2 | 2 | 1 | 0 |
| 452063 | 1084 | 11849472 | 2 | 1 | 2 | 1 | 0 |
| 598424 | 1359 | 78326444 | 0 | 1 | 1 | 0 | 1 |
Now, we tagged almost every customer as either Gifter or Self-Gifter.
# Calculating sum of gifter and self_gifter flags per customer
cols = {'gifter':'sum_of_gifter', 'self_gifter':'sum_of_selfgifter'}
final_flag = analytical_df.groupby('customer').agg({'gifter':'sum','self_gifter':'sum'}).rename(columns=cols)
final_flag.sort_values(by=['customer'],ascending=True)
| sum_of_gifter | sum_of_selfgifter | |
|---|---|---|
| customer | ||
| 692 | 1 | 0 |
| 959 | 2 | 0 |
| 1084 | 1 | 0 |
| 1359 | 1 | 1 |
| 9775 | 1 | 0 |
| ... | ... | ... |
| 31490631 | 1 | 1 |
| 31490634 | 0 | 1 |
| 31490635 | 1 | 0 |
| 31490636 | 0 | 1 |
| 31490637 | 1 | 1 |
355655 rows × 2 columns
Taking the Sum of those gifter and self-gifter columns for each customer to decide on whether they one of them or kind of both gifter and self-gifter.
# Creating final flag
final_flag['final_flag'] = np.where((final_flag['sum_of_gifter']!=0) & (final_flag['sum_of_selfgifter']==0), 1,
np.where((final_flag['sum_of_gifter']==0) & (final_flag['sum_of_selfgifter']!=0), 2 ,
np.where((final_flag['sum_of_gifter']!=0) & (final_flag['sum_of_selfgifter']!=0), 3, 0)))
final_flag.reset_index()
| customer | sum_of_gifter | sum_of_selfgifter | final_flag | |
|---|---|---|---|---|
| 0 | 692 | 1 | 0 | 1 |
| 1 | 959 | 2 | 0 | 1 |
| 2 | 1084 | 1 | 0 | 1 |
| 3 | 1359 | 1 | 1 | 3 |
| 4 | 9775 | 1 | 0 | 1 |
| ... | ... | ... | ... | ... |
| 355650 | 31490631 | 1 | 1 | 3 |
| 355651 | 31490634 | 0 | 1 | 2 |
| 355652 | 31490635 | 1 | 0 | 1 |
| 355653 | 31490636 | 0 | 1 | 2 |
| 355654 | 31490637 | 1 | 1 | 3 |
355655 rows × 4 columns
# Checking unique values of final flag
final_flag.final_flag.unique()
array([1, 3, 2, 0])
Final Flag Column Info
# Creating final output
final_out = final_flag.drop(['sum_of_gifter','sum_of_selfgifter'], axis = 1)
final_out = final_out.rename_axis('customer').reset_index()
final_out.head()
| customer | final_flag | |
|---|---|---|
| 0 | 692 | 1 |
| 1 | 959 | 1 |
| 2 | 1084 | 1 |
| 3 | 1359 | 3 |
| 4 | 9775 | 1 |
# Changing the flags to names for clarity
final_out['final_flag'] = np.where(final_out['final_flag']==1, 'Gifter',
np.where(final_out['final_flag']==2, 'Self-Gifter',
np.where(final_out['final_flag']==3, 'Both', 'Not Known')))
final_out
| customer | final_flag | |
|---|---|---|
| 0 | 692 | Gifter |
| 1 | 959 | Gifter |
| 2 | 1084 | Gifter |
| 3 | 1359 | Both |
| 4 | 9775 | Gifter |
| ... | ... | ... |
| 355650 | 31490631 | Both |
| 355651 | 31490634 | Self-Gifter |
| 355652 | 31490635 | Gifter |
| 355653 | 31490636 | Self-Gifter |
| 355654 | 31490637 | Both |
355655 rows × 2 columns
# Pie chart representation of the segmentation
gifts_segment = final_out.groupby('final_flag').agg({'final_flag':'first', 'customer':'nunique'})
a = gifts_segment.index
b = gifts_segment['customer']
plt.pie(b, labels = a, radius = 1.5,textprops = {"fontsize":15}, autopct = "%1.1f%%",
wedgeprops = {"edgecolor":"black",'linewidth':0.5})
plt.show()
# Grouping final flag
final_out['final_flag'].value_counts()
Gifter 228764 Self-Gifter 95858 Both 30919 Not Known 114 Name: final_flag, dtype: int64
Gifting Flag Inference:
# Only keeping the Engagement category data in non-returns final merged data
engagement_df = final_merged.loc[(final_merged['merch_category_rollup'] == 'Engagement') |
(final_merged['merch_category_rollup'] == 'Loose Diamond')]
engagement_df
| merch_category_rollup | merch_sub_category_rollup | MERCH_PRODUCT_CATEGORY | email_address_key | basket_id | original_basket_id | order_type | site | basket_start_date | submit_date | ... | derived_gender | age | OFFER_KEY_x | name | TARGET_GENDER | display_type | PRODUCT_KEY | OFFER_KEY_y | BILL_TO_ADDRESS_KEY | SHIP_TO_ADDRESS_KEY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | Engagement | Engagement | BYO Semi-Mounts | 19696924 | 23989389 | 23989389 | SPECIAL ORDER | BN | 06-06-2021 | 2021-06-07 | ... | M | 31.0 | NaN | NaN | NaN | NaN | 3565789.0 | 76212.0 | 11129079 | 11129080 |
| 3 | Engagement | Engagement | BYO Semi-Mounts | 24120678 | 12758238 | 12758238 | STANDARD | BN | 20-09-2021 | 2021-09-20 | ... | M | 37.0 | NaN | NaN | NaN | NaN | 3601472.0 | 76207.0 | 11329644 | 11178488 |
| 5 | Engagement | Engagement | BYO Semi-Mounts | 24092029 | 18812838 | 18812838 | STANDARD | BN | 11-07-2021 | 2021-08-18 | ... | M | 34.0 | NaN | NaN | NaN | NaN | 2604603.0 | 60452.0 | 11264941 | 11264941 |
| 6 | Engagement | Engagement | BYO Semi-Mounts | 24079756 | 47744213 | 47744213 | STANDARD | BN | 09-06-2021 | 2021-06-09 | ... | M | NaN | NaN | NaN | NaN | NaN | 2888321.0 | 63046.0 | 11137487 | 11135306 |
| 7 | Engagement | Engagement | BYO Semi-Mounts | 11025516 | 83308313 | 83308313 | SPECIAL ORDER | BN | 24-06-2021 | 2021-06-25 | ... | M | NaN | NaN | NaN | NaN | NaN | 3577155.0 | 76455.0 | 11163959 | 11163960 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 677943 | Engagement | Engagement | BYO 3-Stone Metal | 24281009 | 65117329 | 65117329 | STANDARD | BN | 29-09-2021 | 2021-10-08 | ... | M | NaN | NaN | NaN | NaN | NaN | 2737184.0 | 53140.0 | 11366100 | 11366101 |
| 677944 | Engagement | Engagement | BYO Semi-Mounts | 15321364 | 71624005 | 71624005 | SPECIAL ORDER | BN | 23-06-2022 | 2022-06-20 | ... | F | 1968.0 | NaN | NaN | NaN | NaN | 3855345.0 | 76309.0 | 9762218 | 11955334 |
| 677945 | Engagement | Engagement | BYO 3-Stone Metal | 24074257 | 14960454 | 14960454 | STANDARD | BN | 24-06-2021 | 2021-06-07 | ... | M | 23.0 | NaN | NaN | NaN | NaN | 2096005.0 | 49871.0 | 11158433 | 11158434 |
| 677948 | Engagement | Engagement | BYO 3-Stone Metal | 24520746 | 82913867 | 82913867 | STANDARD | BN | 11-11-2021 | 2021-11-11 | ... | F | NaN | NaN | NaN | NaN | NaN | 3253923.0 | 53140.0 | 11442830 | 11442831 |
| 677953 | Engagement | Engagement | BYO 3-Stone Metal | 19190442 | 64237203 | 64237203 | STANDARD | BN | 08-10-2020 | 2020-10-08 | ... | M | 60.0 | NaN | NaN | NaN | NaN | 2098637.0 | 49876.0 | 10504692 | 10504692 |
79322 rows × 43 columns
Engagement data has about 79,322 orders.
# Checking if the categories remained in the above data do belong to only those 2 categories
engagement_df.merch_category_rollup.unique()
array(['Engagement', 'Loose Diamond'], dtype=object)
# Columns in the engagement data
list(engagement_df.columns)
['merch_category_rollup', 'merch_sub_category_rollup', 'MERCH_PRODUCT_CATEGORY', 'email_address_key', 'basket_id', 'original_basket_id', 'order_type', 'site', 'basket_start_date', 'submit_date', 'order_date', 'ship_date', 'ship_delivery_date', 'return_date', 'discount_name', 'local_currency', 'promo_discount_applied_flag', 'local_currency_before_disc_sale_amount', 'discount_promo_amount', 'local_currency_after_disc_sale_amount', 'usd_after_disc_sale_amount', 'usd_after_disc_cost_amount', 'usd_exchange_credit_amount', 'quantity', 'DIAMOND_SKU_1', 'DIAMOND_SKU_2', 'DIAMOND_SKU_4', 'DIAMOND_SKU_5', 'OFFER_ID', 'email_Address', 'email_address', 'first_name', 'last_name', 'derived_gender', 'age', 'OFFER_KEY_x', 'name', 'TARGET_GENDER', 'display_type', 'PRODUCT_KEY', 'OFFER_KEY_y', 'BILL_TO_ADDRESS_KEY', 'SHIP_TO_ADDRESS_KEY']
# Dropping unnecessary and repetitive columns due to joins
engagement_df=engagement_df.drop(['merch_category_rollup','merch_sub_category_rollup','MERCH_PRODUCT_CATEGORY',
'original_basket_id','order_type','site','basket_start_date','ship_date','ship_delivery_date',
'return_date','discount_name','local_currency','promo_discount_applied_flag',
'local_currency_before_disc_sale_amount','discount_promo_amount',
'local_currency_after_disc_sale_amount','usd_after_disc_cost_amount',
'usd_exchange_credit_amount','quantity','DIAMOND_SKU_1','DIAMOND_SKU_2','DIAMOND_SKU_4',
'DIAMOND_SKU_5','OFFER_ID','email_Address','email_address','first_name','last_name',
'OFFER_KEY_x','name','display_type','PRODUCT_KEY','OFFER_KEY_y'],axis = 1)
engagement_df
| email_address_key | basket_id | submit_date | order_date | usd_after_disc_sale_amount | derived_gender | age | TARGET_GENDER | BILL_TO_ADDRESS_KEY | SHIP_TO_ADDRESS_KEY | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 19696924 | 23989389 | 2021-06-07 | 2021-06-07 | 5621.00 | M | 31.0 | NaN | 11129079 | 11129080 |
| 3 | 24120678 | 12758238 | 2021-09-20 | 2021-09-20 | 2759.00 | M | 37.0 | NaN | 11329644 | 11178488 |
| 5 | 24092029 | 18812838 | 2021-08-18 | 2021-08-18 | 3178.00 | M | 34.0 | NaN | 11264941 | 11264941 |
| 6 | 24079756 | 47744213 | 2021-06-09 | 2021-06-13 | 1668.00 | M | NaN | NaN | 11137487 | 11135306 |
| 7 | 11025516 | 83308313 | 2021-06-25 | 2021-06-25 | 3749.00 | M | NaN | NaN | 11163959 | 11163960 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 677943 | 24281009 | 65117329 | 2021-10-08 | 2021-10-08 | 25481.50 | M | NaN | NaN | 11366100 | 11366101 |
| 677944 | 15321364 | 71624005 | 2022-06-20 | 2022-06-28 | 13973.43 | F | 1968.0 | NaN | 9762218 | 11955334 |
| 677945 | 24074257 | 14960454 | 2021-06-07 | 2021-06-27 | 3758.00 | M | 23.0 | NaN | 11158433 | 11158434 |
| 677948 | 24520746 | 82913867 | 2021-11-11 | 2021-11-16 | 17564.52 | F | NaN | NaN | 11442830 | 11442831 |
| 677953 | 19190442 | 64237203 | 2020-10-08 | 2020-10-09 | 6235.00 | M | 60.0 | NaN | 10504692 | 10504692 |
79322 rows × 10 columns
# Changing email address key column as customer
engagement_df = engagement_df.rename(columns = {'email_address_key' : 'Customer'})
# Taking only columns that we need
engagement_df = engagement_df[['Customer', 'submit_date', 'basket_id', 'usd_after_disc_sale_amount']]
# Sorting them in the Descending order of Submit Date
engagement_df.sort_values(by = ['submit_date'], ascending = False)
| Customer | submit_date | basket_id | usd_after_disc_sale_amount | |
|---|---|---|---|---|
| 443889 | 31490238 | 2022-08-03 | 34373726 | 4232.50 |
| 583219 | 31480521 | 2022-08-03 | 88201332 | 24945.62 |
| 671734 | 31486262 | 2022-08-03 | 84878846 | 862.74 |
| 528026 | 31490243 | 2022-08-03 | 72212082 | 1238.50 |
| 615416 | 31490189 | 2022-08-03 | 69803977 | 6997.35 |
| ... | ... | ... | ... | ... |
| 3209 | 22838239 | 2020-08-01 | 79139376 | 10339.00 |
| 334138 | 22838107 | 2020-08-01 | 72716443 | 3042.00 |
| 28613 | 22834473 | 2020-08-01 | 88117283 | 816.00 |
| 11438 | 22838240 | 2020-08-01 | 47732552 | 3997.00 |
| 626897 | 22809875 | 2020-08-01 | 28246872 | 2844.99 |
79322 rows × 4 columns
# Counting unique values in each column
engagement_df.nunique()
Customer 70432 submit_date 733 basket_id 75453 usd_after_disc_sale_amount 50418 dtype: int64
We have about 70,432 customers in Engagement Category.
# Grouping at customer level
columns = {'submit_date':'Recency_Date','basket_id':'Frequency', 'usd_after_disc_sale_amount':'Sales_Amount'}
engagement_rfm = engagement_df.groupby('Customer').agg({'submit_date':'first','basket_id':'nunique',
'usd_after_disc_sale_amount':'sum'}).rename(columns=columns).round(decimals=2)
engagement_rfm.reset_index()
| Customer | Recency_Date | Frequency | Sales_Amount | |
|---|---|---|---|---|
| 0 | 959 | 2022-04-12 | 1 | 2685.19 |
| 1 | 10261 | 2020-10-17 | 3 | 9604.79 |
| 2 | 10728 | 2022-05-16 | 1 | 4408.00 |
| 3 | 11625 | 2021-05-31 | 1 | 83.18 |
| 4 | 12177 | 2021-02-25 | 1 | 10010.69 |
| ... | ... | ... | ... | ... |
| 70427 | 31490557 | 2022-08-03 | 1 | 1181.50 |
| 70428 | 31490598 | 2022-08-03 | 1 | 3111.84 |
| 70429 | 31490625 | 2022-08-03 | 1 | 7437.50 |
| 70430 | 31490630 | 2022-08-03 | 1 | 2181.57 |
| 70431 | 31490633 | 2022-08-03 | 1 | 3490.50 |
70432 rows × 4 columns
# Getting most recent ordered date
reference_date = engagement_rfm['Recency_Date'].max() + pd.DateOffset(days=1)
reference_date
Timestamp('2022-08-04 00:00:00')
Reference Date is the next day to the most recent date in the Recency Date column which we take it as a reference to calculate the recency days.
# Creating Recency column in days format
engagement_rfm['Recency'] = engagement_rfm['Recency_Date'].apply(lambda row: (reference_date - row).days)
engagement_rfm.drop('Recency_Date', inplace = True, axis = 1)
engagement_rfm = engagement_rfm[['Recency', 'Frequency', 'Sales_Amount']]
engagement_rfm.head()
| Recency | Frequency | Sales_Amount | |
|---|---|---|---|
| Customer | |||
| 959 | 114 | 1 | 2685.19 |
| 10261 | 656 | 3 | 9604.79 |
| 10728 | 80 | 1 | 4408.00 |
| 11625 | 430 | 1 | 83.18 |
| 12177 | 525 | 1 | 10010.69 |
# Creating Monetary column
engagement_rfm['Monetary'] = (engagement_rfm['Sales_Amount']/engagement_rfm['Frequency']).round(decimals=2)
# Dropping Sales Amount column
engagement_rfm = engagement_rfm.drop('Sales_Amount', axis = 1)
engagement_rfm.reset_index(inplace=True)
engagement_rfm.head()
| Customer | Recency | Frequency | Monetary | |
|---|---|---|---|---|
| 0 | 959 | 114 | 1 | 2685.19 |
| 1 | 10261 | 656 | 3 | 3201.60 |
| 2 | 10728 | 80 | 1 | 4408.00 |
| 3 | 11625 | 430 | 1 | 83.18 |
| 4 | 12177 | 525 | 1 | 10010.69 |
We now created the Rececy, Frequency and Monetary value of all 70,432 customers.
# Checking the minimum value of Monetary Column
engagement_rfm['Monetary'].min()
0.0
# Removing those values with 0 monetary value after checking the purchase info
engagement_rfm = engagement_rfm[engagement_rfm.Monetary != 0]
# Checking if 0 Monetary values are removed or not
engagement_rfm['Monetary'].min()
49.0
# Checking Recency Range
a, b = engagement_rfm.Recency.min(), engagement_rfm.Recency.max()
a, b
(1, 733)
# Checking Frequency Range
c, d = engagement_rfm.Frequency.min(), engagement_rfm.Frequency.max()
c, d
(1, 1835)
# Checking Monetary Range
e, f = engagement_rfm.Monetary.min(), engagement_rfm.Monetary.max()
e, f
(49.0, 810965.47)
# Creating bins to check distribution
rec_range = [0, 90, 180, 270, 360, 450, 540, 630, 740]
rec_labels = ['<3','3-6','6-9','9-12','12-15','15-18','18-21','21+']
freq_range = [0, 1, 2, 5, 20, 50, 1835]
freq_labels = ['1','2','3-5','6-20', '21-50','>50']
money_range = [0, 1000, 3000, 5000, 10000, 20000, 50000, 70000, 100000, 900000]
money_labels = ['<1000','1-3k','3-5k', '5-10k','10-20k','20-50k','50-70k','70k-1lc', '>1lc']
engagement_rfm['R_Labels'] = pd.cut(engagement_rfm['Recency'], bins = rec_range, labels = rec_labels)
engagement_rfm['F_Labels'] = pd.cut(engagement_rfm['Frequency'], bins = freq_range, labels = freq_labels)
engagement_rfm['M_Labels'] = pd.cut(engagement_rfm['Monetary'], bins = money_range, labels = money_labels)
engagement_rfm.head()
| Customer | Recency | Frequency | Monetary | R_Labels | F_Labels | M_Labels | |
|---|---|---|---|---|---|---|---|
| 0 | 959 | 114 | 1 | 2685.19 | 3-6 | 1 | 1-3k |
| 1 | 10261 | 656 | 3 | 3201.60 | 21+ | 3-5 | 3-5k |
| 2 | 10728 | 80 | 1 | 4408.00 | <3 | 1 | 3-5k |
| 3 | 11625 | 430 | 1 | 83.18 | 12-15 | 1 | <1000 |
| 4 | 12177 | 525 | 1 | 10010.69 | 15-18 | 1 | 10-20k |
# Checking the Distribution of the Bins
fig = plt.figure(figsize=(14,14))
# Recency Plot
ax1 = plt.subplot2grid((2,3),(0,0))
ax = engagement_rfm['R_Labels'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Recency Bins Distribution', y = 1)
for p in ax.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
# Frequency Plot
ax1 = plt.subplot2grid((2,3), (0,1))
ax = engagement_rfm['F_Labels'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Frequency Bins Distribution', y = 1)
for p in ax.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
# Monetary Plot
ax1 = plt.subplot2grid((2,3), (0,2))
ax = engagement_rfm['M_Labels'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Monetary Bins Distribution', y = 1)
for p in ax.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
fig.tight_layout()
plt.show()
# Getting the number of customers over different Bins Distribution
a = ((engagement_rfm['R_Labels'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
b = ((engagement_rfm['F_Labels'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
c = ((engagement_rfm['M_Labels'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
a, b, c
({'21+': 16.5,
'18-21': 13.63,
'15-18': 12.84,
'6-9': 12.83,
'9-12': 11.61,
'12-15': 11.56,
'<3': 10.71,
'3-6': 10.33},
{'1': 97.94,
'2': 1.58,
'3-5': 0.36,
'6-20': 0.11,
'21-50': 0.01,
'>50': 0.01},
{'1-3k': 26.76,
'5-10k': 25.38,
'3-5k': 20.9,
'10-20k': 13.6,
'<1000': 7.18,
'20-50k': 5.29,
'50-70k': 0.46,
'70k-1lc': 0.22,
'>1lc': 0.21})
Recency
Frequency
Monetary
# Dropping Label columns of R, F and M as we no longer need them
engagement_rfm = engagement_rfm.drop(['R_Labels','F_Labels','M_Labels'],axis = 1)
engagement_rfm.head()
| Customer | Recency | Frequency | Monetary | |
|---|---|---|---|---|
| 0 | 959 | 114 | 1 | 2685.19 |
| 1 | 10261 | 656 | 3 | 3201.60 |
| 2 | 10728 | 80 | 1 | 4408.00 |
| 3 | 11625 | 430 | 1 | 83.18 |
| 4 | 12177 | 525 | 1 | 10010.69 |
# Function to create R value
def Rvalue(x):
if x <= 180:
return 4
elif x <= 360:
return 3
elif x <= 540:
return 2
else:
return 1
# Function to create F value
def Fvalue(x):
if x <= 1:
return 1
else:
return 2
# Function to create M value
def Mvalue(x):
if x <= 2500:
return 1
elif x <= 5000:
return 2
elif x <= 10000:
return 3
elif x <= 25000:
return 4
else:
return 5
# Adding R, F and M segment value columns to the existing dataset
engagement_rfm['R'] = engagement_rfm['Recency'].apply(Rvalue)
engagement_rfm['F'] = engagement_rfm['Frequency'].apply(Fvalue)
engagement_rfm['M'] = engagement_rfm['Monetary'].apply(Mvalue)
engagement_rfm.head()
| Customer | Recency | Frequency | Monetary | R | F | M | |
|---|---|---|---|---|---|---|---|
| 0 | 959 | 114 | 1 | 2685.19 | 4 | 1 | 2 |
| 1 | 10261 | 656 | 3 | 3201.60 | 1 | 2 | 2 |
| 2 | 10728 | 80 | 1 | 4408.00 | 4 | 1 | 2 |
| 3 | 11625 | 430 | 1 | 83.18 | 2 | 1 | 1 |
| 4 | 12177 | 525 | 1 | 10010.69 | 2 | 1 | 4 |
# Checking the Distribution of the bins
fig = plt.figure(figsize=(14,14))
# Recency Plot
ax1 = plt.subplot2grid((2,3),(0,0))
ax = engagement_rfm['R'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Recency Distribution', y = 1)
for p in ax.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
# Frequency Plot
ax1 = plt.subplot2grid((2,3), (0,1))
ax = engagement_rfm['F'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Frequency Distribution', y = 1)
for p in ax.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
# Monetary Plot
ax1 = plt.subplot2grid((2,3), (0,2))
ax = engagement_rfm['M'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Monetary Distribution', y = 1)
for p in ax.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
fig.tight_layout()
plt.show()
# Getting the number of customers over different final Bins Distribution
a = ((engagement_rfm['R'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
b = ((engagement_rfm['F'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
c = ((engagement_rfm['M'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
sorted(a.items()), sorted(b.items()), sorted(c.items())
([(1, 30.12), (2, 24.4), (3, 24.44), (4, 21.04)], [(1, 97.94), (2, 2.06)], [(1, 27.57), (2, 27.27), (3, 25.38), (4, 15.9), (5, 3.88)])
# Concatinating the 3 values as final RFM column
engagement_rfm['RFM'] = engagement_rfm.R.map(str) + engagement_rfm.F.map(str) + engagement_rfm.M.map(str)
engagement_rfm
| Customer | Recency | Frequency | Monetary | R | F | M | RFM | |
|---|---|---|---|---|---|---|---|---|
| 0 | 959 | 114 | 1 | 2685.19 | 4 | 1 | 2 | 412 |
| 1 | 10261 | 656 | 3 | 3201.60 | 1 | 2 | 2 | 122 |
| 2 | 10728 | 80 | 1 | 4408.00 | 4 | 1 | 2 | 412 |
| 3 | 11625 | 430 | 1 | 83.18 | 2 | 1 | 1 | 211 |
| 4 | 12177 | 525 | 1 | 10010.69 | 2 | 1 | 4 | 214 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 70427 | 31490557 | 1 | 1 | 1181.50 | 4 | 1 | 1 | 411 |
| 70428 | 31490598 | 1 | 1 | 3111.84 | 4 | 1 | 2 | 412 |
| 70429 | 31490625 | 1 | 1 | 7437.50 | 4 | 1 | 3 | 413 |
| 70430 | 31490630 | 1 | 1 | 2181.57 | 4 | 1 | 1 | 411 |
| 70431 | 31490633 | 1 | 1 | 3490.50 | 4 | 1 | 2 | 412 |
70430 rows × 8 columns
# Unique combinations of RFM column
eng = engagement_rfm.RFM.unique()
print(sorted(eng))
['111', '112', '113', '114', '115', '121', '122', '123', '124', '125', '211', '212', '213', '214', '215', '221', '222', '223', '224', '225', '311', '312', '313', '314', '315', '321', '322', '323', '324', '325', '411', '412', '413', '414', '415', '421', '422', '423', '424', '425']
# Total Distinct Customer categories we got through RFM
engagement_rfm.RFM.nunique()
40
engagement_rfm['Segment'] = np.where(engagement_rfm['RFM'].isin(['425','415','325','315','225','215','125','115']),
'VIP',
np.where(engagement_rfm['RFM'].isin(['424','423','324','323']), 'Best',
np.where(engagement_rfm['RFM'].isin(['414','413','314','313','224','223','214','213','124','123',
'114','113']), 'Potential',
np.where(engagement_rfm['RFM'].isin(['422','421','322','321']), 'Promising',
np.where(engagement_rfm['RFM'].isin(['412','411','312','311']), 'Rookies',
np.where(engagement_rfm['RFM'].isin(['222','221','212','211']), 'Winback',
np.where(engagement_rfm['RFM'].isin(['122','121','112','111']),
'Churned','Others')))))))
engagement_rfm
| Customer | Recency | Frequency | Monetary | R | F | M | RFM | Segment | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 959 | 114 | 1 | 2685.19 | 4 | 1 | 2 | 412 | Rookies |
| 1 | 10261 | 656 | 3 | 3201.60 | 1 | 2 | 2 | 122 | Churned |
| 2 | 10728 | 80 | 1 | 4408.00 | 4 | 1 | 2 | 412 | Rookies |
| 3 | 11625 | 430 | 1 | 83.18 | 2 | 1 | 1 | 211 | Winback |
| 4 | 12177 | 525 | 1 | 10010.69 | 2 | 1 | 4 | 214 | Potential |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 70427 | 31490557 | 1 | 1 | 1181.50 | 4 | 1 | 1 | 411 | Rookies |
| 70428 | 31490598 | 1 | 1 | 3111.84 | 4 | 1 | 2 | 412 | Rookies |
| 70429 | 31490625 | 1 | 1 | 7437.50 | 4 | 1 | 3 | 413 | Potential |
| 70430 | 31490630 | 1 | 1 | 2181.57 | 4 | 1 | 1 | 411 | Rookies |
| 70431 | 31490633 | 1 | 1 | 3490.50 | 4 | 1 | 2 | 412 | Rookies |
70430 rows × 9 columns
# Plotting the Segmentation
eng_seg = engagement_rfm.groupby(['Segment'])['Customer'].count()
ax = eng_seg.plot(kind='bar', figsize=(6,6), width = 0.5, edgecolor=None)
for p in ax.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
# Percentage of customers among different segments
(engagement_rfm['Segment'].value_counts(normalize=True) * 100).round(decimals=2)
Potential 41.03 Rookies 23.26 Churned 17.67 Winback 13.24 VIP 3.88 Promising 0.66 Best 0.25 Name: Segment, dtype: float64
We divided the customers into 6 kind of segments and we have Potential Customers more among all the segments having 33% of distribution.
# Getting Jewelry data Info
jewelry_df
| email_address_key | basket_id | submit_date | order_date | usd_after_disc_sale_amount | derived_gender | age | TARGET_GENDER | BILL_TO_ADDRESS_KEY | SHIP_TO_ADDRESS_KEY | |
|---|---|---|---|---|---|---|---|---|---|---|
| 41 | 31371922 | 45832107 | 2022-04-30 | 2022-04-30 | 1481.0 | F | NaN | NaN | 11855742 | 11855742 |
| 42 | 12192824 | 26092410 | 2022-04-11 | 2022-04-11 | 2625.0 | M | 41.0 | NaN | 7867151 | 11817631 |
| 43 | 22167265 | 15123721 | 2021-10-18 | 2021-10-18 | 3778.0 | M | NaN | NaN | 11384809 | 11384810 |
| 44 | 23403248 | 28743548 | 2020-11-30 | 2020-11-30 | 1124.0 | M | NaN | NaN | 10642691 | 10642692 |
| 45 | 1590647 | 30915560 | 2021-02-21 | 2021-02-21 | 1543.0 | F | NaN | NaN | 10901412 | 10901412 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 677949 | 3740508 | 48317500 | 2020-08-28 | 2020-08-30 | 1437.0 | F | NaN | NaN | 10435269 | 10435269 |
| 677950 | 22859237 | 64797963 | 2020-08-10 | 2020-08-10 | 1712.0 | M | 42.0 | NaN | 10399071 | 10399071 |
| 677951 | 23036258 | 22989040 | 2020-09-15 | 2020-09-16 | 2380.0 | M | NaN | NaN | 10466130 | 10466130 |
| 677952 | 10708737 | 78485771 | 2021-02-16 | 2021-02-16 | 2357.0 | M | 39.0 | NaN | 10893648 | 10893649 |
| 677954 | 18101292 | 51603645 | 2020-08-22 | 2020-08-23 | 3545.0 | M | NaN | NaN | 10422395 | 10412311 |
545821 rows × 10 columns
# Changing email address key column as customer
jewelry_df = jewelry_df.rename(columns = {'email_address_key' : 'Customer'})
# Taking only columns that we need
jewelry_df = jewelry_df[['Customer', 'submit_date', 'basket_id', 'usd_after_disc_sale_amount']]
# Sorting them in the Descending order of Submit Date
jewelry_df.sort_values(by = ['submit_date'], ascending = False)
| Customer | submit_date | basket_id | usd_after_disc_sale_amount | |
|---|---|---|---|---|
| 627090 | 31441046 | 2022-08-03 | 21152111 | 7982.10 |
| 441805 | 24304301 | 2022-08-03 | 35990748 | 1500.00 |
| 276327 | 31490604 | 2022-08-03 | 11192980 | 30.00 |
| 516607 | 31490214 | 2022-08-03 | 67543863 | 841.50 |
| 400545 | 31490605 | 2022-08-03 | 83973282 | 1352.00 |
| ... | ... | ... | ... | ... |
| 230121 | 22838650 | 2020-08-01 | 70850136 | 265.00 |
| 362450 | 22822119 | 2020-08-01 | 89567762 | 3946.05 |
| 506077 | 20931609 | 2020-08-01 | 21512752 | 266.00 |
| 531063 | 22838480 | 2020-08-01 | 18121710 | 553.00 |
| 437620 | 2835069 | 2020-08-01 | 44203668 | 1962.00 |
545821 rows × 4 columns
# Counting unique values in each column
jewelry_df.nunique()
Customer 355655 submit_date 733 basket_id 446618 usd_after_disc_sale_amount 54879 dtype: int64
# Grouping at customer level
columns = {'submit_date':'Recency_Date','basket_id':'Frequency', 'usd_after_disc_sale_amount':'Sales_Amount'}
jewelry_rfm = jewelry_df.groupby('Customer').agg({'submit_date':'first','basket_id':'nunique',
'usd_after_disc_sale_amount':'sum'}).rename(columns=columns).round(decimals=2)
jewelry_rfm.reset_index()
| Customer | Recency_Date | Frequency | Sales_Amount | |
|---|---|---|---|---|
| 0 | 692 | 2021-12-16 | 1 | 862.50 |
| 1 | 959 | 2022-04-12 | 1 | 5487.50 |
| 2 | 1084 | 2021-12-19 | 1 | 2242.50 |
| 3 | 1359 | 2021-12-15 | 2 | 1041.00 |
| 4 | 9775 | 2022-05-01 | 1 | 3435.00 |
| ... | ... | ... | ... | ... |
| 355650 | 31490631 | 2022-08-03 | 1 | 355.00 |
| 355651 | 31490634 | 2022-08-03 | 1 | 50.29 |
| 355652 | 31490635 | 2022-08-03 | 1 | 203.00 |
| 355653 | 31490636 | 2022-08-03 | 1 | 1032.00 |
| 355654 | 31490637 | 2022-08-03 | 1 | 160.00 |
355655 rows × 4 columns
# Getting most recent ordered date
reference_date = jewelry_rfm['Recency_Date'].max() + pd.DateOffset(days=1)
reference_date
Timestamp('2022-08-04 00:00:00')
# Creating Recency column in days format
jewelry_rfm['Recency'] = jewelry_rfm['Recency_Date'].apply(lambda row: (reference_date - row).days)
jewelry_rfm.drop('Recency_Date', inplace = True, axis = 1)
jewelry_rfm = jewelry_rfm[['Recency', 'Frequency', 'Sales_Amount']]
jewelry_rfm.reset_index()
| Customer | Recency | Frequency | Sales_Amount | |
|---|---|---|---|---|
| 0 | 692 | 231 | 1 | 862.50 |
| 1 | 959 | 114 | 1 | 5487.50 |
| 2 | 1084 | 228 | 1 | 2242.50 |
| 3 | 1359 | 232 | 2 | 1041.00 |
| 4 | 9775 | 95 | 1 | 3435.00 |
| ... | ... | ... | ... | ... |
| 355650 | 31490631 | 1 | 1 | 355.00 |
| 355651 | 31490634 | 1 | 1 | 50.29 |
| 355652 | 31490635 | 1 | 1 | 203.00 |
| 355653 | 31490636 | 1 | 1 | 1032.00 |
| 355654 | 31490637 | 1 | 1 | 160.00 |
355655 rows × 4 columns
# Creating Monetary column
jewelry_rfm['Monetary'] = (jewelry_rfm['Sales_Amount']/jewelry_rfm['Frequency']).round(decimals=2)
# Dropping Sales Amount column
jewelry_rfm = jewelry_rfm.drop('Sales_Amount', axis = 1)
jewelry_rfm.reset_index(inplace=True)
jewelry_rfm.head()
| Customer | Recency | Frequency | Monetary | |
|---|---|---|---|---|
| 0 | 692 | 231 | 1 | 862.5 |
| 1 | 959 | 114 | 1 | 5487.5 |
| 2 | 1084 | 228 | 1 | 2242.5 |
| 3 | 1359 | 232 | 2 | 520.5 |
| 4 | 9775 | 95 | 1 | 3435.0 |
# Checking Recency Range
a, b = jewelry_rfm.Recency.min(), jewelry_rfm.Recency.max()
a, b
(1, 733)
# Checking Frequency Range
c, d = jewelry_rfm.Frequency.min(), jewelry_rfm.Frequency.max()
c, d
(1, 1787)
# Checking maximum Monetary
e = jewelry_rfm.Monetary.max()
e
340000.01
# Creating bins to check distribution
rec_range = [0, 90, 180, 270, 360, 450, 540, 630, 740]
rec_labels = ['<3','3-6','6-9','9-12','12-15','15-18','18-21','21+']
freq_range = [0, 1, 2, 5, 15, 50, 1800]
freq_labels = ['1','2','3-5','6-15', '16-50','>50']
money_range = [0, 100, 300, 500, 1000, 3000, 5000, 10000, 20000, 50000, 70000, 100000, 400000]
money_labels = ['<100','100-300','300-500','500-1k','1-3k','3-5k', '5-10k','10-20k','20-50k','50-70k','70k-1lc', '>1lc']
jewelry_rfm['R_Labels'] = pd.cut(jewelry_rfm['Recency'], bins = rec_range, labels = rec_labels)
jewelry_rfm['F_Labels'] = pd.cut(jewelry_rfm['Frequency'], bins = freq_range, labels = freq_labels)
jewelry_rfm['M_Labels'] = pd.cut(jewelry_rfm['Monetary'], bins = money_range, labels = money_labels)
jewelry_rfm.head()
| Customer | Recency | Frequency | Monetary | R_Labels | F_Labels | M_Labels | |
|---|---|---|---|---|---|---|---|
| 0 | 692 | 231 | 1 | 862.5 | 6-9 | 1 | 500-1k |
| 1 | 959 | 114 | 1 | 5487.5 | 3-6 | 1 | 5-10k |
| 2 | 1084 | 228 | 1 | 2242.5 | 6-9 | 1 | 1-3k |
| 3 | 1359 | 232 | 2 | 520.5 | 6-9 | 2 | 500-1k |
| 4 | 9775 | 95 | 1 | 3435.0 | 3-6 | 1 | 3-5k |
# Checking the Distribution of the segmentations
fig = plt.figure(figsize=(14,14))
# Recency Plot
ax1 = plt.subplot2grid((2,3),(0,0))
ax = jewelry_rfm['R_Labels'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Recency Bins Distribution', y = 1)
for p in ax.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
# Frequency Plot
ax1 = plt.subplot2grid((2,3), (0,1))
ax = jewelry_rfm['F_Labels'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Frequency Bins Distribution', y = 1)
for p in ax.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
# Monetary Plot
ax1 = plt.subplot2grid((2,3), (0,2))
ax = jewelry_rfm['M_Labels'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Monetary Bins Distribution', y = 1)
for p in ax.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
fig.tight_layout()
plt.show()
# Getting the number of customers over different Bins Distribution
a = ((jewelry_rfm['R_Labels'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
b = ((jewelry_rfm['F_Labels'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
c = ((jewelry_rfm['M_Labels'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
a, b, c
({'18-21': 19.1,
'6-9': 18.63,
'21+': 11.78,
'3-6': 11.11,
'15-18': 10.94,
'9-12': 10.4,
'12-15': 9.05,
'<3': 8.98},
{'1': 83.25,
'2': 12.38,
'3-5': 3.98,
'6-15': 0.36,
'16-50': 0.02,
'>50': 0.0},
{'500-1k': 22.48,
'100-300': 22.29,
'1-3k': 21.08,
'300-500': 16.02,
'<100': 12.23,
'3-5k': 3.2,
'5-10k': 1.87,
'10-20k': 0.61,
'20-50k': 0.19,
'50-70k': 0.01,
'70k-1lc': 0.0,
'>1lc': 0.0})
# Dropping Label columns of R, F and M as we no longer need them
jewelry_rfm = jewelry_rfm.drop(['R_Labels','F_Labels','M_Labels'],axis = 1)
jewelry_rfm.head()
| Customer | Recency | Frequency | Monetary | |
|---|---|---|---|---|
| 0 | 692 | 231 | 1 | 862.5 |
| 1 | 959 | 114 | 1 | 5487.5 |
| 2 | 1084 | 228 | 1 | 2242.5 |
| 3 | 1359 | 232 | 2 | 520.5 |
| 4 | 9775 | 95 | 1 | 3435.0 |
# Function to create R value
def Rval(x):
if x <= 180:
return 4
elif x <= 360:
return 3
elif x <= 540:
return 2
else:
return 1
# Function to create F value
def Fval(x):
if x <= 1:
return 1
elif x <= 4:
return 2
else:
return 3
# Function to create M value
def Mval(x):
if x <= 300:
return 1
elif x <= 750:
return 2
elif x <= 3500:
return 3
elif x <= 20000:
return 4
else:
return 5
# Adding R, F and M segment value columns to the existing dataset
jewelry_rfm['R'] = jewelry_rfm['Recency'].apply(Rval)
jewelry_rfm['F'] = jewelry_rfm['Frequency'].apply(Fval)
jewelry_rfm['M'] = jewelry_rfm['Monetary'].apply(Mval)
jewelry_rfm.head()
| Customer | Recency | Frequency | Monetary | R | F | M | |
|---|---|---|---|---|---|---|---|
| 0 | 692 | 231 | 1 | 862.5 | 3 | 1 | 3 |
| 1 | 959 | 114 | 1 | 5487.5 | 4 | 1 | 4 |
| 2 | 1084 | 228 | 1 | 2242.5 | 3 | 1 | 3 |
| 3 | 1359 | 232 | 2 | 520.5 | 3 | 2 | 2 |
| 4 | 9775 | 95 | 1 | 3435.0 | 4 | 1 | 3 |
# Checking the Distribution of the segmentations
fig = plt.figure(figsize=(12,12))
# Recency Plot
ax1 = plt.subplot2grid((2,3),(0,0))
ax = jewelry_rfm['R'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Recency Distribution', y = 1)
for p in ax1.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
# Frequency Plot
ax1 = plt.subplot2grid((2,3), (0,1))
ax = jewelry_rfm['F'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Frequency Distribution', y = 1)
for p in ax1.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
# Monetary Plot
ax1 = plt.subplot2grid((2,3), (0,2))
ax = jewelry_rfm['M'].value_counts().plot(kind='bar', width = 0.4, edgecolor = None)
plt.title('Monetary Distribution', y = 1)
for p in ax1.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
fig.tight_layout()
plt.show()
# Getting the number of customers over different final Bins Distribution
a = ((jewelry_rfm['R'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
b = ((jewelry_rfm['F'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
c = ((jewelry_rfm['M'].value_counts(normalize = True) * 100).round(decimals=2)).to_dict()
sorted(a.items()), sorted(b.items()), sorted(c.items())
([(1, 30.88), (2, 19.99), (3, 29.03), (4, 20.09)], [(1, 83.25), (2, 16.02), (3, 0.73)], [(1, 34.52), (2, 29.61), (3, 31.14), (4, 4.52), (5, 0.21)])
# Concatinating the 3 values as final RFM column
jewelry_rfm['RFM'] = jewelry_rfm.R.map(str) + jewelry_rfm.F.map(str) + jewelry_rfm.M.map(str)
jewelry_rfm
| Customer | Recency | Frequency | Monetary | R | F | M | RFM | |
|---|---|---|---|---|---|---|---|---|
| 0 | 692 | 231 | 1 | 862.50 | 3 | 1 | 3 | 313 |
| 1 | 959 | 114 | 1 | 5487.50 | 4 | 1 | 4 | 414 |
| 2 | 1084 | 228 | 1 | 2242.50 | 3 | 1 | 3 | 313 |
| 3 | 1359 | 232 | 2 | 520.50 | 3 | 2 | 2 | 322 |
| 4 | 9775 | 95 | 1 | 3435.00 | 4 | 1 | 3 | 413 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 355650 | 31490631 | 1 | 1 | 355.00 | 4 | 1 | 2 | 412 |
| 355651 | 31490634 | 1 | 1 | 50.29 | 4 | 1 | 1 | 411 |
| 355652 | 31490635 | 1 | 1 | 203.00 | 4 | 1 | 1 | 411 |
| 355653 | 31490636 | 1 | 1 | 1032.00 | 4 | 1 | 3 | 413 |
| 355654 | 31490637 | 1 | 1 | 160.00 | 4 | 1 | 1 | 411 |
355655 rows × 8 columns
# Unique combinations of RFM column
jwl = jewelry_rfm.RFM.unique()
print(sorted(jwl))
['111', '112', '113', '114', '115', '121', '122', '123', '124', '125', '131', '132', '133', '134', '211', '212', '213', '214', '215', '221', '222', '223', '224', '225', '231', '232', '233', '234', '235', '311', '312', '313', '314', '315', '321', '322', '323', '324', '325', '331', '332', '333', '334', '335', '411', '412', '413', '414', '415', '421', '422', '423', '424', '425', '431', '432', '433', '434']
# Total Distinct Customer categories we got through RFM
jewelry_rfm.RFM.nunique()
58
We have 58 unique categories of kinds of customers now we need to bucket them into one of the below segments to target them.
# Tagging Customers
jewelry_rfm['Segment'] = np.where(jewelry_rfm['RFM'].isin(['425','415','335','325','315','235','225','215','125','115']),
'VIP',
np.where(jewelry_rfm['RFM'].isin(['434','433','424','423','334','333','324','323']), 'Best',
np.where(jewelry_rfm['RFM'].isin(['414','413','314','313','234','233','224','223','214','213','134','133','124','114',
'123','114','113']), 'Potential',
np.where(jewelry_rfm['RFM'].isin(['432','431','422','421','332','331','322','321']), 'Promising',
np.where(jewelry_rfm['RFM'].isin(['232','231','222','221','212','211']), 'Winback',
np.where(jewelry_rfm['RFM'].isin(['132','131','122','121','112','111']), 'Churned',
np.where(jewelry_rfm['RFM'].isin(['412','411','312','311']), 'Rookies','Others')))))))
jewelry_rfm.head()
| Customer | Recency | Frequency | Monetary | R | F | M | RFM | Segment | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 692 | 231 | 1 | 862.5 | 3 | 1 | 3 | 313 | Potential |
| 1 | 959 | 114 | 1 | 5487.5 | 4 | 1 | 4 | 414 | Potential |
| 2 | 1084 | 228 | 1 | 2242.5 | 3 | 1 | 3 | 313 | Potential |
| 3 | 1359 | 232 | 2 | 520.5 | 3 | 2 | 2 | 322 | Promising |
| 4 | 9775 | 95 | 1 | 3435.0 | 4 | 1 | 3 | 413 | Potential |
# Plotting the Segmentation
jwl_seg = jewelry_rfm.groupby(['Segment'])['Customer'].count()
ax = jwl_seg.plot(kind='bar', figsize = (6,6), width = 0.5, edgecolor = None)
for p in ax.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
ax.annotate(f'{height}', (x + width/2, y + height*1.02), ha='center')
# Percentage of customers among different segments
(jewelry_rfm['Segment'].value_counts(normalize = True) * 100).round(decimals=2)
Potential 32.43 Rookies 26.05 Churned 20.88 Winback 12.39 Promising 4.81 Best 3.23 VIP 0.21 Name: Segment, dtype: float64
We divided the customers into 6 kind of segments and we have Potential Customers more among all the segments having 32% of distribution.